Implement startup_order and stop_criteria#2714
Conversation
|
Do you plan to also support |
|
@peterschmidt85 in a separate PR |
|
and then we can update the NCCL tests example |
| if run.run_spec.merged_profile.stop_criteria != StopCriteria.MASTER_DONE: | ||
| return False | ||
| for job in run.jobs: | ||
| if job.job_spec.job_num == 0 and job.job_submissions[-1].status == JobStatus.DONE: |
There was a problem hiding this comment.
(nit) Can also check for termination_reason == JobTerminationReason.DONE_BY_RUNNER to terminate the run faster, without waiting for the terminating -> done master job transition. See line 241
There was a problem hiding this comment.
I'm not sure if we want to terminate the run before the master is really done.
| class StartupOrder(str, Enum): | ||
| ANY = "any" | ||
| MASTER_FIRST = "master-first" | ||
| WORKERS_FIRST = "workers-first" |
There was a problem hiding this comment.
(nit) I'm not sure about calling non-master nodes "workers", because the master node is also a "worker" - it performs the same work other nodes do.
I can suggest to use "secondary" (secondary-first) or avoid any names (master-last). Although we might still need a name to use in the code
There was a problem hiding this comment.
master/worker is standard terminalogy used for pytorch, mpi, etc. Let's not reinvent.
A part of #2467
The PR introduces new run configuration properties:
startup_orderallows specifying the order in which master and workers jobs are started.stop_criteriaallows specifying the criteria determining when a multi-node run should be considered finished.They simplify running mpirun with
dstack:Other multi-node tasks such as iperf may require
startup_order: master-first. Most such as pytorch will work with the defaultstartup_order: any.TODO: